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Abstract.  Graph  partitioning  problems  have  a  wide  range  of  applications  in  machine  learning. 
This  work  analyzes  convergence  conditions  for  a  class  of  diffuse  interface  algorithm  [A.L.  Bertozzi 
and  A.  Flenner,  Multiscale  Modeling  &  Simulation ,  10(3):  1090  1118,  2012]  for  binary  and  multi¬ 
class  partitioning.  Using  techniques  from  numerical  PDE  and  convex  optimization,  convergence 
and  monotonicity  are  shown  for  a  class  of  schemes  under  a  graph-independent  timestep  restriction. 
We  also  analyze  the  effects  of  spectral  truncation,  a  common  technique  used  to  save  computational 
cost.  Convergence  of  the  scheme  with  spectral  truncation  is  also  proved  under  a  timestep  restriction 
inversely  proportional  to  the  size  of  the  graph.  Moreover,  this  restriction  is  shown  to  be  sharp  in 
a  worst  case.  Various  numerical  experiments  are  done  to  compare  theoretical  results  with  practical 
performance. 
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1.  Introduction.  Graph  cut  methods  have  been  widely  used  in  data  clustering 
and  image  segmentation  [9,11,15].  Recently,  the  reformulation  of  graph  cut  problems 
in  graph  total  variation  (TV)  minimization  has  lead  to  various  fast  approximations 
for  the  optimization  [7,8].  In  particular,  a  method  inspired  by  diffuse  interface  models 
in  PDE  was  proposed  [5].  This  approach  has  been  applied  to  various  applications  in 
clustering,  image  segmentation,  and  image  inpainting  [21,23,28]. 

Classical  diffuse  interface  models  are  built  around  the  Ginzburg-Landau  func¬ 
tional  in  Euclidean  space,  defined  as 

(1.1)  GL(u)  =  ^J  |Vu|2  +  i  J  W(u(x))dx. 

W  is  the  double- well  potential  W(u)  =  \{u2  —  l)2,  with  minimizers  1  and  —1.  The 
first  term  is  the  H 1  semi-norm  of  the  function  u,  which  penalizes  non-smoothness  of 
u.  The  parameter  e  controls  the  scale  of  the  diffuse  interface,  namely,  the  sharpness 
of  the  transition  between  two  phases.  The  Ginzburg-Landau  functional  and  TV  is 
related  through  the  notion  of  gamma-convergence  [30]: 

YivaGLJu)  — CTV(u). 

e— f  0  ' 

Evolution  by  the  Ginzburg-Landau  functional  has  been  used  to  model  dynamics  of 
two  phases  in  material  science  [14, 16].  The  most  common  of  which  is  the  Allen-Cahn 
equation,  the  L2  gradient  flow  on  the  Ginzburg-Landau  functional.  Under  suitable 
rescaling,  the  Allen-Cahn  equation  has  been  shown  to  converge  to  motion  by  mean 
curvature  [25,36],  and  thus  fast  numerical  solvers  for  motion  by  mean  curvature  such 
as  the  MBO  scheme  [29]  could  be  utilized  to  further  approximate  the  Allen-Cahn 
equation. 

The  class  of  methods  that  we  study  here  are  graph  analogues  of  these  classical 
PDE  models.  Graph  Laplacian  and  graph  TV  are  used  in  the  place  of  their  classical 
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counterparts,  and  a  comprehensive  list  of  these  correspondences  could  be  found  in  [36]. 
Graph  TV  minimization  and  the  discrete  graph  cut  problem  have  been  shown  to  be 
equivalent  in  [27] .  Moreover,  gannna-convergence  of  graph  Ginzburg-Landau  to  graph 
TV  is  proved  in  [35],  justifying  its  use  to  approximate  the  graph  cut  energy.  This 
paper  will  focus  on  the  graph  Allen-Cahn  equation,  the  L2  gradient  flow  of  the  graph 
Ginzburg-Landau  functional. 

Graph  Laplacians  are  themselves  of  great  interest,  and  have  been  used  in  classical 
machine  learning  algorithms  such  as  PageRank  [12]  and  spectral  clustering  [38].  They 
share  many  similar  characteristics  with  the  continuous  Laplacians  and  can  be  shown 
to  converge  to  continuum  limits  under  suitable  assumptions  [35,39].  Since  they  are 
central  to  this  paper,  we  introduce  below  some  basics  about  them  below. 

Consider  a  weighted  undirected  graph  G  with  vertices  ordered  {1, 2, . . . ,  n}.  Each 
pair  of  vertices  (i,  j)  is  assigned  a  proximity  measure  Wij.  The  weights  Wij  form  a 
weight  matrix  or  adjacency  matrix  of  the  graph  G.  Given  a  weight  matrix  W,  one 
can  construct  three  different  types  of  graph  Laplacians,  namely, 

(1.2)  Lu  =  D  —  W  Unormalized  Laplacian, 

(1.3)  Ls  =  I  —  D~l^2W D^1!2  Symmetric  Laplacian, 

(1.4)  Lrw  =  /  —  D~XW  Random  Walk  Laplacian, 


where  D  is  the  diagonal  matrix  da  =  y~b  wa .  The  unnormalized  graph  Laplacian 
Lu  has  the  following  nice  formula  for  its  Diriclilet  energy,  analogous  to  that  of  the 
continuum  Laplacian: 

(1.5)  ;!,( u,Luu )  =  \  Vi^Ui  “  urf- 

ij 

The  random  walk  Laplacian  Lrw  has  probabilistic  interpretations,  and  can  deal  with 
outliers  nicely  [22].  The  symmetric  Laplacian  Ls  shares  the  same  eigenvalues  with 
Lrw ,  but  is  easier  to  deal  with  computationally  due  to  its  symmetry.  In  particular,  the 
unnormalized  Laplacian  and  the  normalized  Laplacian  have  very  distinct  eigenvectors, 
as  can  be  seen  from  the  visualization  in  Fig. 1.1.  The  visualization  is  a  plot  of  the 
third  eigenvector  for  the  nonlocal  means  graph  formed  from  neighborhood  patches  of 
each  pixel  (see  [5]  for  details).  For  notational  convenience,  we  omit  the  superscript 
for  L  if  the  situation  applies  to  all  versions  of  the  Laplacians.  We  discuss  all  three 
Laplacians  whenever  there  is  a  distinction  to  be  made. 

In  this  paper,  the  only  restrictions  imposed  on  the  weight  matrix  W  are  symmetry 
and  non-negativity.  In  particular,  we  do  not  require  triangle  inequality.  Under  this 
assumption,  a  useful  characterization  of  an  unnornralized  Laplacian  is  given  by: 

Proposition  1.1  (Characterization  of  the  Unnormalized  Graph  Laplacian).  Given 
a  matrix  Lu ,  there  exists  a  weight  matrix  W  such  that  Lu  is  the  corresponding  un¬ 
normalized  graph  Laplacian  if  and  only  if 


(1.6) 


I  Lfj  <  0,  i  j , 

£S--EFr 

1 


Proof.  See  definition  of  Lu  and  D.  □ 
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(a)  Original  Image  (b)  Symmmetric  Laplacian  (c)  Unnormalized  Laplacian 

Fig.  1.1:  Comparison  of  Third  Eigenvector  of  Graph  Laplacians 


We  define  the  Ginzburg-Landau  energy  on  graphs  by  replacing  the  spatial  Lapla¬ 
cian  with  the  graph  Laplacian  L. 

(1.7)  GL{u)  -  |( u,Lu )  +  y  W(uj). 


Note  from  here  on  L  could  be  one  of  three  versions  of  the  graph  Laplacian  .  The  Allen- 
Cahn  equation  on  graphs  is  defined  as  the  gradient  flow  of  the  graph  Ginzburg-Landau 
energy 

(1.8)  ut  =  -VGL(zt)  =  -eLu  -  -W'(u). 

In  [5],  a  semi-implicit  numerical  scheme  was  used  to  counter  the  ill-conditioning 
of  the  graph  Laplacian 

fc+l  _  k  i 

(1.9)  - — =  -eLuk+1  - -W\uk). 

at  e 

Moreover,  a  convex  penalty  |u2  can  be  used  to  form  a  “convex-concave”  split, 
this  gives  us  the  actual  scheme  in  [5] 

ni^~ _  7/^"  1 

(1.10)  - — - =  -eLuk+1  -  cuk+1  +  cuk  -  -W'(uk). 

Convex-concave  splitting  originated  in  an  unpublished  paper  by  Eyre  [17],  and  has 
been  used  to  resolve  long-time  solutions  of  the  Cahn-Hilliard  equation  in  [4,37].  These 
Cahn-Hilliard  equations  were  meant  to  resolve  the  time  dynamics  whereas  here  we 
are  simply  trying  to  find  the  equilibrium  point.  Therefore,  for  the  particular  scheme 

(1.10) ,  we  show  in  the  next  proposition  that  it  is  equivalent  with  (1.9)  and  thus  we 
henceforth  only  analyze  the  scheme  without  convex  splitting  (1.9). 

Lemma  1.2  (Rescaled  Timestep).  The  scheme  (1-10)  is  the  same  as 


(1.11) 


uk+ 1  -  uk  =  - 


dt 


1  +  cdt 


eLuk+1  - 


dt  lTTr/,  k , 
:-W'(uk). 


1  +  cdt  e 


Hence  (1.10)  is  equivalent  to  (1.9)  under  a  rescaling  of  stepsize. 

In  practice,  a  fidelity  term  f((j>(u),u)  is  often  added.  Since  the  Ginzburg-Landau 
functional  is  a  smoothed  version  of  TV,  this  model  is  reminiscent  of  the  classical  ROF 
model  [31]  used  in  imaging  when  the  fidelity  term  is  quadratic.  In  the  next  sections, 
we  analyze  the  scheme  (1.9)  without  fidelity  first,  and  later  incorporate  the  fidelity 
term  in  the  analysis. 
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2.  Maximum  Principle- A00  Estimates.  The  main  result  for  this  section  is 
the  following: 

Proposition  2.1  (A  Priori  Boundeness) .  Define  uk  by 
(2.1)  uk  =  [(1  +  <^-Lu)]~1(uk~1  -  -W'iu*-1)), 

derived  from  (1.9),  and  Lu  is  the  unnormalized  graph  Laplacian  Assume  ||w°||oo  <  1- 
If  dt  <  0.5e,  then  \/k,  ||wfe||oo  <  1- 

Details  of  the  proposition  will  be  covered  in  the  next  two  sections.  What  is  notable 
is  that  the  timestep  restriction  is  independent  of  the  graph,  i.e. ,  this  universal  bound 
is  guaranteed  to  work  for  any  graph  of  any  size.  The  constant  .5e  is  analogous  to 
the  ODE  stiffness  condition(see  [33]).  To  prove  the  result,  we  need  the  maximum 
principle  on  graphs. 

2.1.  Maximum  Principle.  The  classical  maximum  principle  argument  relies 
on  the  fact  that  Aitfip)  >  0  for  Xq  a  local  minimizer.  This  fact  is  also  true  for  graphs. 

Proposition  2.2  (Second  Order  Condition  on  Graphs).  Let  u  be  a  function 
defined  on  a  graph,  and  Lu  be  the  unnormalized  graph  Laplacian  .  Suppose  u  achieves 
a  local  minimum  at  a  vertex  i,  then  Luu |j  <  0,  where  a  local  minimizer  i  is  defined 
as  Ui  <  Uj,  \/wij  >  0. 

Proof.  Let  i  be  a  local  minimizer.  Then 

LUu\i  =  diUi  -  ^2  Wij  U3 
iAi 

Wij(ui  —  uj)  <  0.  □ 

Next,  we  prove  a  discrete  analogue  of  the  continuous  time  maximum  principle,  which 
states  that  the  implicitly  discretized  scheme  for  the  heat  equation  on  graphs  is  de¬ 
creasing  in  the  L^  norm.  This  line  of  thought  is  inspired  by  the  maximum  principle 
for  finite  difference  operators  [13]. 

Proposition  2.3  (Maximum  Principle  for  Discrete  Time).  For  any  dt  >  0,  let 
u  be  a  solution  to 

(2.3)  u  =  — dt  *  (Lu)  +  v , 
then  we  have  ||n||oo  <  IMIoo- 

Proof.  Suppose  i  is  a  maximum  of  u.  Then  since  u(i)  —  dt  *  (—Lu)(i)  +  v(i) 
and  (—Lu)(i)  <  0,  we  have  maxu  =  u(i)  <  v(i)  <  maxv.  Arguing  similarly  with  the 
minimum,  we  have  that  ll'uHoc,  <  H't'Hoo -  D 

Remark:  The  condition  Lij  <  0,  i  j  is  crucial  for  the  above  analysis  to  hold. 
If  L  is  replaced  by  a  general  positive-semidefinite  matrix,  we  still  have  an  L2  version 
of  the  proposition,  namely,  ||ti||2  <  |M|2i  but  the  L°°  version  is  not  true. 

2.2.  Proof  of  Proposition  2.1.  We  use  the  maximum  principle  to  prove  that 
the  sequence  {uk}  is  uniformly  bounded  under  a  suitable  initial  condition.  Consider 
splitting  the  one-line  scheme  (1.9)  into  two  parts. 

,  N  f  vk  =uk  -dt*-W'(uk), 

(2.4)  {  e 

{  uk+1  =  -dt*(eLuuk+1)  +  vk. 
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Note  that  by  the  maximum  principle,  ||wfc+1||oo  <  ||vfc||oo-  We  need  the  lemma 
below  to  control  the  L°°  norm  of  the  first  line  in  (2.4). 

Lemma  2.4.  Define  the  map 

(2.5)  Fdt{x)  =  x  —  dtW'(x)  =  x  —  dtx(x2  —  1). 


If  dt  <  0.5,  Tdt  maps  [—1, 1]  to  itself. 

Proof.  Note  Tdt(h  1)  =  ±1 ,  Vdf,  thus  Tdt  maps  endpoints  to  endpoints.  Tdt  maps 
the  entire  interval  to  itself  if  it  is  monotone  on  the  interval.  Computing  the  roots  r, 

of  T'dt(r)  =  0,  Vi  =  ±^/|(gj  +  1).  Since  Tdt  is  qubic,  it  is  easy  to  see  that  Tdt  is 
monotone  on  [—1, 1]  iff  \rf\  >  1,  i.e. ,  dt  <  0.5.  □ 

Note  that  Tdt(x)  here  is  the  component- wise  map  of  the  first  line  in  (2.4)  and  is 
the  source  of  the  stepsize  restriction.  Since  estimates  for  other  forward  steps  Tdt(x) 
follows  the  same  idea  as  above  and  involves  only  elementary  calculations,  we  omit 
some  of  the  details  later.  Next,  we  prove  our  main  conclusion  below: 

Proof.  [Proposition  2.1]  We  prove  by  induction.  Suppose  IIm^Hoo  <  1-  Then  for 
each  vertex  i,  we  have  vk(i)  =  uk{i)  —  ^W'{uk(i))  =  Tdt/e(uk(i)).  By  Lemma  2.4, 
|ffe(z)|  <  1, Vi,  given  that  *  <  0.5.  Hence  ||t'fc||oo  <  1-  Then  by  Proposition  2.3, 

Halloo  <  Halloo  <  1.  □ 

If  we  are  not  so  keen  on  keeping  uk(i)  in  [—1,1]  for  each  iteration,  but  merely 
care  about  whether  the  scheme  is  bounded  or  not,  then  we  may  relax  the  interval  a 
bit  to  get  a  larger  range  of  dt,  as  the  next  lemma  shows. 

Proposition  2.5.  Fordt  <  2.1,  Tdt  maps  [—1.4, 1.4]  to  itself.  Hence  z/||w0||oo  < 
1.4,  then  {wfc}  is  bounded  for  dt  <  2.1e. 

Derivation  of  the  constants:  Define  <f>(c )  to  be  the  maximum  dt  for  which  Tdt 
maps  [— c,  c]  to  itself.  Since  Tdt  is  qubic,  (f>  is  continuous.  The  exact  constants  are 
then  found  by  using  a  computer  program  to  brute  force  maximize  4>. 

For  future  reference,  the  0.5e  bound  will  be  called  the  “tight  bound”  where  the 
2.1e  bound  will  be  called  the  “loose  bound”.  Again,  we  must  emphasis  that  the  exact 
constants  here  do  not  matter  so  much  as  the  fact  that  they  are  independent  of  the 
graph. 

2.3.  Generalizations  of  the  scheme.  In  this  section,  we  prove  boundedness 
results  for  other  versions  of  the  scheme. 

Proposition  2.6  (Random  walk  graph  Laplacian) .  Let  ||u0||(X)  <  1.  If  dt  <  0.5e, 
the  scheme  (1.9)  with  L  =  Lrw  satisfies  ||wfc||oo  <  l,Vfc. 

Proof.  The  proof  follows  similarly  from  that  of  Proposition  2.1  by  noting  that  the 
second  order  condition  (2.2)  holds  also  for  random  walk  Laplacian  Lrw.  □ 

The  case  on  symmetric  graph  Laplacian  is  a  little  different,  and  a  uniformity 
condition  on  the  graph  must  be  added. 

Proposition  2.7  (Symmetric  graph  laplacian).  Define  the  uniformity  constant 

p  as 


(2.6) 


max;  di 
min.;  di 


If  P  <2,  ||w°||oo  <  ,  dt  <  0.5e,  then  the  sequence  {wfc}  is  bounded. 

Proof.  We  note  that  while  L\^  is  no  longer  negative  (thus  losing  the  maximum 
principle),  we  still  have  the  relation  Ls  =  D1l2LrwD~xl'1 .  Thus  line  2  of  (2.4)  becomes 

D~1/2uk+1  =  -dt  *  LrwD~^2uk+1  +  fl-'/V. 


(2.7) 
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By  the  maximum  principle,  \\D  1/2ufc+1||00  <  \\D  1^2vk\\00.  We  rescale  uk  as  iik  = 
vhnin.j  di  x D~1^2uk+1 , and  vk  =  ^vah\idiXD~1^2vk+1 .  The  first  line  of  (2.4)  becomes 

v\i)  =  -Tdt/e(Ciuk(i)), 

C-i 

where  c*  =  d,  )1^2-  Note  that  by  uniformity  condition  (2.6),  Ci  £  [1  ,-\/2],Vi. 

Next,  we  prove  ||ufc||oo  <  1  by  induction.  This  is  clearly  true  for  k  =  0  since 
1 1 ft0 1 joo  <  v/p||u°||oo  <  1-  For  general  k,  define  <&i(a;)  =  T dt / e{cix)  to  be  a  rescaled 
version  of  the  forward  step.  We  claim  iq  maps  [—1, 1]  to  itself  for  all  i.  This  follows 
easily  from  the  following  lemma: 

Lemma  2.8  (Uniform  scaling).  For  any  dt  <  0.5,  Tut  as  defined  in  (2.5)  maps 
[— c,  c]  to  itself  for  any  c  £  [1,  V2], 

The  proof  of  the  Lemma  2.8  can  be  done  using  brute  force  calculation  similar  to 
that  of  Lemma  2.4.  Thus  we  get  vk(i )  £  [—1,1].  Applying  the  maximum  principle 
and  equation  (2.7),  we  finally  get  ||ufc+1||00  <  1  and  complete  the  induction  argument. 
□ 

In  the  context  of  semi-supervised  learning  [5,21],  a  quadratic  fidelity  term  appears 
in  the  scheme.  We  show  boundedness  of  the  graph  Allen-Cahn  scheme  with  this  added 
fidelity.  Restating  from  [5]  the  scheme  with  fidelity: 


(2g)  f  vk=uk-dt*(^W'(uk)+VA(uk-<t>0)), 

\uk+1  =-dt*(eLuk+1)+vk, 

where  4>°{i)  £  {1,  —1},  for  i  belonging  to  a  fidelity  set  A. 

Proposition  2.9  (Graph  Allen-Cahn  with  fidelity).  The  graph  Allen-Cahn 
scheme  with  fidelity  (2.8)  is  bounded  by  2  for  dt  <  Typ^e,  where  p  is  the  fidelity 
strength  and  e  is  the  diffuse  parameter. 

Proof.  Define  the  modified  forward  step  <&dt ( u )  =  u  —  dt[(u2  —  l)u  +  p(u  —  1)] .  By 
the  same  argument  as  in  Lemma  2.4  by  noting  4>  is  qubic,  the  limit  stepsize  is  given 

by  Thus  dt  =  2+^e-  D 

2.4.  Uniform  Convergence  to  ODE.  As  an  interesting  corollary  of  the  Max¬ 
imum  Principle,  we  prove  that  the  discrete  scheme  uk  converges  to  the  ODE  equation 
in  a  graph  independent  way. 

Proposition  2.10  (Convergence  to  ODE).  Let  U  be  the  continuous  time  solution 
of  (1.8)  defined  on  a  fixed  interval  [0,  T\,  where  L  is  either  the  unnormalized  Laplacian 
or  random  walk  Laplacian.  Assume  ||/7(.,  0)||oo  <  1.  Then  the  semi-implicit  scheme 
in  (1.9)  has  first  order  converges  uniformly  to  the  continuous  time  solution  U  with 
respect  to  L. 

Proof.  Set  Uk  =  U(.,kdt)  Define  ek  =  uk  —  Uk  to  be  the  global  error.  Then  we 
have  to  show  ||efc||  <  Mdt ,  where  M  is  independent  of  dt  and  the  graph  Laplacian  L. 

By  assumption,  ||z<,0||oo  <  1.  Therefore  by  Proposition  2.1  we  have  that  the 
sequence  {uk}  is  uniformly  bounded.  Since  U  is  a  gradient  flow,  GL(U(.,t))  is  de¬ 
creasing  in  time  and  thus  U  is  also  bounded.  Therefore,  we  may  assume  W"  to  be 
bounded  (by  a  graph- free  constant).  In  the  following  notation,  M  and  the  constant 
for  big  O  is  graph- independent.  We  first  compute  the  local  truncation  error: 

Lemma  2.11.  Define  the  local  truncation  error  rfe  as  below: 


jjk+i  _  jjk 


=  - LUk+1 


W'(Uk)  +  Tk. 


(2.9) 


dt 
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Then  Ht^’Hoo  <  Mdt  Subtract  (2.9)  from  the  discrete  scheme,  and  we  get  an  evolution 
equation  for  the  error  term 

Ay  |  X  Ay 

(2.10)  6 - f-6-  =  Lek+1  -  (W'(uk)  -  W'(Uk))  +  rk. 

at 

By  using  Taylor  expansion  on  W7,  we  have 

(2.11)  ek+1  =  -dtLek+1  -  dtW"{^)ek  +  ek  +  rk . 

Here  W"  is  the  diagonal  matrix  IF"  =  \ V"(£j).  By  applying  the  maximum  principle, 
taking  L°°  norm  gives  us 

(2.12)  ||efc+1||oo  <  ||  -  dtW"(£)ek  +  ek  +  <  (1  +  Mdt)\\ek\ 

for  some  graph  independent  constant  M . 

Hence  by  iterating  the  equation  above,  we  get 

||efe||oo  <  ( MeMT)dt  =  0{dt).  □ 

3.  Energy  method- L2  estimates.  In  this  section,  we  derive  estimates  in  terms 
of  the  L2  norm,  commonly  used  to  prove  convergence  of  finite  difference  schemes  of 
parabolic  PDEs.  Our  goal  is  to  prove  convergence  to  stationary  point  for  the  full 
scheme  and  also  convergence  of  scheme  with  spectral  truncation.  Our  proof  is  loosely 
motivated  by  the  analysis  on  convex-concave  splitting  in  [17,40].  For  example  in  [17], 
Eyre  proved: 

Proposition  3.1  (Eyre).  Let  E  =  E\  +  £2  be  a  splitting  with  E\  convex  and 
E2  concave.  Then  for  any  dt,  the  semi-implicit  scheme  uk+1  =  uk  —  dt\/ E\{uk+1)  — 
dtV E2{uk)  is  monotone  in  E,  namely, 

E(uk+1)  <  E{uk ),  Vfc  >  0. 


The  semi-implicit  scheme  in  fact  coincides  with  the  notion  of  proximal  gradient 
method  for  minimizing  the  splitting  E  =  E\  +  E2.  Recall  that  proximal  gradient 
iteration  is  given  as  [6] 

(3.1)  uk+1  =  ProxdtEl  ( uk  -  dfV E2(uk)), 
where  the  Prox  operator  is  defined  as 

ProXjf(x)  =  argminu{f  (u )  +  ||u  —  £||2}- 

2y 

The  Prox  operator  is  defined  for  all  proper  convex  functions  taking  extended  real 
values,  and  its  connection  to  the  semi-implicit  scheme  is  clear  from  its  implicit  gradient 
interpretation.  Namely,  if  y  =  Prox7f(x), 

(3.2)  y  =  x-'ydf{y). 

df  is  the  subgradient  of  /,  which  coincides  with  the  gradient  if  /  is  differentiable. 
Classical  results  for  convergence  of  proximal  gradient  method  can  be  found  in  [6] , which 
requires  both  E\,E2  to  be  convex  (instead  of  E2  concave  as  in  Eyre),  and  WE2  be 
Lipschitz  continuous. 
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In  our  case,  E  =  GL(u ),  E\  =  f (u,  Lu)  and  E2  =  ^  Jf,  W(ui).  However,  our  E2 
is  neither  concave  nor  convex.  Instead  of  using  a  quadratic  penalty  to  force  convex- 
concavity  (which  is  shown  to  be  a  rescaling  of  time  in  (1.2))  we  follow  the  analysis  for 
done  for  non-convex  proximal  gradient  method.  The  estimate  below  is  a  simplified 
version  of  the  proof  in  [32],  and  is  analogous  to  that  in  [17].  General  discussions  of 
non-convex  proximal  gradients  are  found  in  [2]. 

Proposition  3.2  (Energy  Estimate).  Let  E  =  E\  +  E2.  Define  xk+1  by  xk+1  € 
xk  —  dtdE\(xk+1)  —  dtWE2(xk).  Suppose  E\  is  convex,  E2  smooth  and\7E2  Lipschitz 
continuous  with  Lipschitz  constant  M ,  we  have 

(3.3)  E(xk)  -  E(xk+l)  >{jt-  ^-)\\xk+1  -  xkf. 


Proof. 

E(xk)  -  E(xk+1)  =  E^x1*)  -  £i(a;fc+1)  +  E2(xk)  -  E2{xk+1) 

>  (dE1(xk+1),xk  -  xk+1)  +  E2(xk)  -  E2(xk+1) 


+  (Vfi2(/),  xk  -  xk+k)  +  E2(xk)  -  E2(xk+l) 


—  _||T.fc+1  _  rpk  II  2 

~  dt" 

>^IP+1-*‘l|2-ylk',+1-*‘l|2. 


The  second  line  is  by  convexity  of  E\,  the  third  by  plugging  in  the  the  definition  of 
xk+l ,  the  fourth  by  simple  Taylor  expansion  of  the  function  E2.  □ 

Even  though  the  proof  is  simple,  we  have  the  freedom  of  choosing  E\  to  be  a 
general  non-smooth  convex  function.  In  particular,  we  later  set  E\{u)  =  7y(w)  + 
|  (u,Lu),  where  I  is  the  indicator  function  and  V  an  eigenspace. 

3.1.  Convergence  of  Scheme.  In  this  section,  we  use  the  estimate  (3.2)  to 
extend  our  result  of  boundedness  to  convergence  under  the  same  stepsize  restriction. 

Proposition  3.3  (Convergence  of  Graph  Allen-Cahn).  Let  ||u0||oo  <  1.  Under 
the  strict  bound  dt,  <  0.5e,  the  scheme  (1.9)  is  monotone  in  the  Ginzburg- Landau 
energy  GL(u)  =  |(u,  Lu)  +  \  JT  W (uj)  and  converges  to  a  stationary  point  of  GL. 

Proof.  From  Proposition  2.1,  dt  <  0.5e,  implies  H'li^Hoo  <  1  ,Vfe.  We  set  E1  = 
|( u,Lu ),  E2  =  ^Y^i.Wiui),  and  apply  Proposition  (3.2).  Since  the  L°°  ball  in  R” 
is  convex,  all  points  £  that  lie  in  the  line  segment  from  uk  to  ufc+1  satisfy  ||£||oo  < 
1.  Thus  we  can  WLOG  assume  the  Lipschitz  constant  M  of  X7E2  to  satisfy  M  < 
7max|x|oo<1  |W"(a;)|  =  -.  Since  dt  <  0.5e  <  -p,  by  Proposition  3.2,  we  have: 


(3.4)  GL(un)  -  GL(un+1)  >  (i  -  y  )|| uk+1  -  ukf, 

where  >  0  by  our  assumption  on  dt.  Hence  uk  is  monotone  in  GL.  Summing 

both  sides  of  (3.4)  we  obtain 


(3.5)  GL(u°)  -  GL[un)  >  (1  -  f )  £  |K  -  ^-1||2- 

i<n 


Since  GL(un)  >  0  and  jj-  —  >  0,  the  sequence  {ul  —  ul  1}  is  square  summable, 

thus  lim  || ul  —  u*_1||  =  0.  Since  {u1}  is  bounded,  there  exists  a  limit  point  u*  for 


the  sequence,  and  u*  is  unique  by  the  condition  lim  \\uz  —  u 1  1||  =0.  Hence  {u1} 

i—>o o 

converges  to  u*.  By  continuity,  u*  is  a  stationary  point,  i.e. ,  VGL(u*)  =  0.  □ 

Remark:  The  argument  does  not  work  for  the  “loose  bound”  dt  <  2.1e,  since 
the  Lipschitz  restriction  jj  is  no  longer  greater  than  2.1e.  There  indeed  exists  cases 
where  the  sequence  {uk}  is  bounded  but  does  not  converge.  However,  in  practice  the 
scheme  tends  to  converge  under  larger  timestep  than  the  tight  bound  0.5e. 

Following  the  same  line  of  proof,  above  and  using  results  in  Section  2,  we  obtain 
a  general  convergence  result  for  the  graph  Allen-Calm  scheme. 

Theorem  3.4  (Main  Convergence  Result).  Let  uk  be  defined  by  a  form  of  the 
Ginzburg-Landau  scheme  as  below: 

(3.6)  uk+1  =uk~dt*  ( eL*uk+1  +  *  W'(uk )  +  r?A (uk  -  0°)), 

L*  being  either  the  unnormalized  or  random  walk  Laplacian.  Then  if  ||u0||oo  <  1, 
then  3c  independent  of  L  such  that  Vdt  <  c,  we  have  lim  uk  =  u* ,  where  u*  is  a 

stationary  point  of  the  Ginzburg-Landau  functional.  The  result  holds  for  symmetric 
Laplacians  if  we  add  an  additional  uniformity  condition  (2.6)  on  the  graph. 

3.2.  Convergence  Results  for  General  Semi-Definite  L.  In  this  section,  we 
study  the  scheme  (1.9)  where  L  is  replaced  by  an  arbitrary  symmetric  semi-positive 
definite  matrix.  We  desire  to  prove  similar  convergence  results  but  without  the  max¬ 
imum  principle.  Instead,  we  rely  solely  on  the  energy  estimates  itself. 

There  are  two  reasons  for  generalizing  L  to  an  arbitrary  semi-positive  definite 
matrix.  First,  this  serves  as  a  baseline  convergence  result  if  we  are  to  study  symmetric 
perturbations  made  on  the  original  L,  such  as  in  the  case  of  using  the  Nystrom 
Extension.  The  second  reason  is  that  the  proof  here  can  be  carried  over  to  show 
convergence  of  (1.9)  under  spectral  truncation.  Our  main  result  is  below: 

Theorem  3.5.  Let  L  be  an  arbitrary  symmetric  and  semi-positive  definite  matrix 
such  that  pl  <  C  for  some  C  independent  of  N,  where  pl  =  max,;  |  A*| .  Define  uk  by 
the  scheme  (1.9)  with  e  =  1,  i.e., 

j  vk  =  uk  -dt*W'{uk), 

^  \uk+1  =  ~dt*{Luk+1)+vk. 

Suppose  IKHoo  <  1,  then  the  scheme  is  monotone  in  the  Ginzburg-Landau  energy  for 
timestep  dt  =  0(N~l),  where  N  is  the  size  of  the  system,  i.e.,  number  of  vertices  in 
the  graph. 

Since  the  result  is  an  analysis  on  dt  vs  N,  we  allow  the  constants  to  depend  on  e, 
and  WLOG  set  e  =  1  in  the  proof. 

Our  strategy  here  is  to  choose  dt  so  small  and  apply  Proposition  3.2  to  force  mono¬ 
tonicity  in  the  Ginzburg-Landau  functional  GL.  Since  GL(u)  =  0(||tt||4),  bounded¬ 
ness  in  function  value  implies  boundedness  in  the  variable  u.  For  our  purpose,  we 
need  a  more  refined  version  of  the  bound  on  GL(u). 

Lemma  3.6  (Inverse  Bound).  Let  M  be  any  positive  constant.  If  GL(u)  <  M, 
then  || u|| 2  <  N  +  y/NM,  where  N  is  the  dimension  of  u. 

Proof.  By  assumption  on  GL(u)  =  J2i(ui  ~  l)2  +  ~  l)2  ^  M- 

Then  from  the  Cauchy-Schwarz  inequality,  yA  (vf  —  1)  <  V NM,  hence  our  lemma.  □ 
To  prove  the  theorem,  we  need  a  direct  estimate  on  the  norm  of  uk+1  from  uk . 
The  proof  of  the  lemma  is  an  application  of  norm  conversions  in  Lemma  8.1  in  the 
Appendix. 
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Lemma  3.7.  Let  uk  and  uk+1  be  successive  iterates  defined  in  (3.7).  Then  their 
function  value  satisfies  the  inequality  below: 

(3.8)  \\uk+%<(l  +  dt)\\u%+dt\\uk\\32. 


Proof.  Since  L  is  symmetric  semi-positive  definite,  we  have  ||ufe+1||2  <  ||ufc||2. 
Then  since  vk(i)  =  uk{i)  —  dt*  [uk(i)(uk(i)2  —  1)],  we  have  ||ffe||2  <  (1  +  di)||«fc||2  + 
dt\\uk\\l  <  (1  +  dt)\\uk\\2  +  dt||Mfc|||  □ 

We  can  now  prove  the  main  conclusion: 

Proof.  [Proposition  3.5.]  Since  ||w°||oo  <  1,  we  have  ||u°||2  <  VN .  Moreover, 
GL(u°)  <  PL\\u°\\2+J2i<N  1  —  Ci N.  By  Lemma  3.6,  for  any  v  s.t.  GL{v)  <  GL(u°), 
we  have  ||u||  <  C2\fN,  for  some  C2  >  1. 

We  claim  that  there  exists  a  constant  S  independent  of  N  such  that  Vdt  <  SN _1, 
(3.9)  holds  for  all  k.  Note  that  C\  and  C2  are  independent  of  both  N  and  the  interation 
number  k. 

GL(uk)  <GL(u° )  <  C\N, 

(3  9)  \  \  - 

\\uk\\2<C2VN. 

We  argue  by  induction.  The  case  k  =  0  has  been  proved  above.  Suppose  this 
is  valid  for  k,  then  we  have  ||itfe||  <  C2y/N.  By  Lemma  3.7,  we  have  ||itfc+1||  < 
(1  +  dtfN1/2  +  4f-dtN3/2,  where  A\  depends  on  Cl,  C2  but  not  k  or  N.  Therefore, 
we  can  choose  dt  <  SN _1,  such  that  ||itfe+1||  <  A\N 1G.  Again,  5  is  not  dependent  on 
k  or  N . 

The  crux  of  the  proof  is  applying  the  energy  estimate  (3.2)  to  uk+1.  Note  the 
estimate  (3.2)  is  valid  with  constant  M  <  maxy^i  )O<a1V]vI|V2W(0||.  Plugging  in 
the  explicit  formula  W(u)  =  j(w2  —  l)2,  we  have  M  <  A2N  for  some  A2.  Thus  by 
further  shrinking  S  if  necessary,  we  have  for  dt  <  SN~k,  -)~t  —  4f  >0.  Hence  applying 
(3.2),  we  have.  GL{uk+1)  <  GL(uk)  <  GL(u°).  However,  this  would  mean  that 
GL(uk+l )  <  C\N ,  and  thus  by  the  inverse  bound  Lemma  3.6,  ||wfc+1||2  <  C2\^N. 
This  completes  the  induction  step.  □ 

Remark:  The  convergence  stepsize  is  graph-size  dependent.  However,  we  show  in 
the  next  section  that  the  dependence  in  unavoidable  if  we  are  dealing  with  arbitrary 
L. 


4.  Analysis  on  Spectral  Truncation.  In  many  applications,  the  number  of 
nodes  A  on  a  graph  is  huge,  and  it  is  infeasible  to  invert  L  every  iteration  in  (1.9). 
In  [5,28],  a  strategy  proposed  was  to  project  u  onto  the  first  m  eigenvectors.  This 
approach  is  called  spectral  truncation.  In  practice,  spectral  truncation  gives  accurate 
segmentation  results  but  is  computationally  much  cheaper.  There  are  several  methods 
for  precomputing  the  eigenvectors  including  Nystrom  method  [18]  which  is  a  random 
sampling  method,  and  the  Raleigh- Chebyshev  method  [1]  for  sparse  matrices. 

4.1.  Convergence  Results  for  Spectral  Truncation.  Let  us  formally  define 
the  spectral  truncated  version  of  scheme  (1.9).  Define  Vm  =  spanffi1,  </>2, . . . ,  cj)m}  to 
be  space  spanned  by  the  m  eigenvectors  of  L  with  the  smallest  eigenvalues.  Define 
Pm  to  be  the  projection  operator  onto  the  space  Vm.  Then  the  spectral  truncated 
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scheme  is  defined  as 


(4.1) 


vk  =  uk  -  dt  *  -W'(uk), 
e 

uk+1  =  Pm[-dt  *  (eLuk+1)  +  vk]. 


Just  as  in  the  previous  analysis,  we  first  show  boundedness. 

Proposition  4.1  (Boundedness  of  Spectral  Truncation).  Let  the  graph  Laplacian 
satisfy  pl  <  C  for  some  C  independent  of  N.  Define  uk  by  the  scheme  (4-1).  Suppose 
||w°||oo  <  1,  then  the  scheme  is  monotone  in  the  Ginzburg- Landau  energy  for  timestep 
dt  =  0(7V-1),  where  TV  is  the  size  of  the  system. 

The  key  to  the  proof  is  the  following  observation,  which  links  spectral  truncation 
to  the  proximal  gradient  method. 

Lemma  4.2  (Reformulation  of  Spectral  Truncation).  If  u°  £  V.m ,  the  spectral 
truncated  scheme  (4-1)  is  equivalent  to  the  proximal  gradient  scheme  (3.1)  with  E\  = 
§( u,Lu )  +  Ivm,  E2  =  where  7ym  is  the  indicator  function  of  the  m.-th 

eigenspace  which  is  0  inside  Vm  and  +00  outside. 

Proof.  Since  the  forward  step,  i.e.,  line  1  of  (4.1)  is  the  same  between  the  two 
schemes,  we  only  have  to  show  the  following:  Define  u,  u '  as 

u  =argmin  ^ (y,  Ly)  +  ^\\y  -  -c||2, 

u'  =argmin  ^(y,  Ly)  +  IVk(y)  +  jf\\y  ~  v\\2. 

We  want  to  show  that  u,  v!  satisfy  the  relation  Pm(u)  =  v! .  Project  onto  eigenvectors 

N  N 

and  we  have  u  =  ci4>1 ,  u'  =  Since  v!  £  Vm,  we  have  c'  =  0,Vi  >  m. 

i=  1  i—1 

Moreover,  letting  dt  = 

1  N 

|(«,  Lu)  +  —\\ u-  v\\2  =  _  d 

~  i—1 

Thus  the  minimization  is  done  component-wise  in  Cj,  and  it  is  easy  to  see  that  Ci  = 
c',Vi  <  to.  Thus  Pm{u)  =  u.  □ 

We  may  now  prove  our  main  result: 

Proof.  [Proposition  4.1]  The  proof  is  almost  identical  to  that  in  Proposition  3.5. 
We  again  argue  inductively  that  (3.9)  holds  for  dt  <  (57V-1.  Suppose  (3.9)  is  true 
for  k.  Since  the  the  projection  operator  P ^  does  not  increase  L2  norm,  we  still  have 
||Rfc+1||  <  (1  +  cft)||ttfc||  +  dt\\uk\\3,  and  thus  ||ufc+1||  <  C\/N.  By  the  eciuivalence 
of  spectral  truncation  with  proximal  gradient  (4.2),  we  may  use  the  energy  estimate 
(3.2),  and  argue  that  GL(uk+1)  <  GL(uk)  under  dt  <  (57V-1.  This  in  turn  forces 
||wfc+1||  <  Czy/N,  ending  the  induction.  □ 

Since  we  now  know  that  ||itfc||  is  bounded  by  O(vTV),  we  can  WLOG  assume  the 
Lipschitz  constant  M  =  O(N).  Thus  by  following  the  proof  of  Proposition  3.3,  we 
have 

Proposition  4.3  (Convergence  Result).  The  truncated  scheme  is  convergent 
under  the  stepsize  restriction  dt  <  dTV-1. 
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4.2.  A  Counter  Example  for  Graph-Independent  Timestep  Restriction. 

In  the  previous  subsection,  we  proved  that  the  spectral  truncated  scheme  is  bounded 
under  stepsize  restriction  dt  =  0{N~1).  One  would  hope  to  achieve  a  graph- free 
stepsize  rule  as  in  the  case  of  the  full  scheme  (1.9).  However,  as  we  show  in  our 
example  below,  uniform  convergence  stepsize  over  all  graph  Laplacian  of  all  sizes  is 
not  possible. 

PROPOSITION  4.4  (Optimality  of  Estimate  4.1).  For  any  6  independent  of  N  and 
dt  =  SN~a,a  <  1,  we  can  always  find  a  graph  Laplacian  Ljvxiv  with  pl  <  1,  and 
an  initial  condition  ||u0||oo  =  1  such  that  the  scheme  in  (4-1)  with  truncation  level 
m  =  2  has  lim  Hu^'Hoo  =  oo.  Hence  our  estimate  for  the  stepsize  restriction  in  (4-1) 

k — yoo 

is  optimal. 

We  explicitly  construct  a  graph  and  a  graph  Laplacian  to  attain  the  worst  case 
bound.  Graph  construction  is  as  follows,  which  is  illustrated  in  Fig  4.1. 


Fig.  4-1:  Illustration  of  Worst  Case  Graph  with  N  =  7 


Construction  of  Graph 

1.  Nodes :  The  nodes  consists  of  two  groups  of  “clusters”  nodes  (in  circles)  and 
two  outlier  nodes  (in  x).  Each  cluster  contains  N  —  1  nodes  and  thus  the  graph 
contains  a  total  of  2 N  nodes. 

2.  Edge  Weights :  Connect  all  nodes  to  each  other  within  clusters  and  set  edge 
weights  to  1  (black  solid  edges) .  Connect  the  inter  cluster  nodes  in  a  pairwise  fashion 
and  set  weights  to  0.1  (gray  solid  edges).  Finally,  connect  the  outlier  node  with  the 
clusters  and  set  edge  weights  very  close  to  °v'  (gray  dashed  edges),  see  Lemma  8.2  in 
the  Appendix  for  further  explanation. 

3.  Indexing :  Nodes  on  the  left  are  indexed  by  odd  numbers  and  nodes  on  the 
right  even.  The  first  and  the  second  node  correspond  to  the  two  outlier  nodes  respec¬ 
tively. 

4.  Graph  Laplacian:  The  graph  Laplacian  is  taken  to  be  where  L  is  the 
unnormalized  graph  Laplacian  D  —  W . 

We  choose  our  initialization  by  thresholding  the  second  eigenvector,  namely,  u°  = 
Sign(4>2).  The  key  property  of  the  graph  lies  in  its  second  eigenvector,  the  computa¬ 
tion  of  which  could  be  found  in  8.2  in  the  Appendix  : 

Proposition  4.5.  Under  the  setup  above,  the  second  eigenvector  of  the  graph 


12 


Laplacian  is 


±2  __  (\  _i  _i _ L_  _ 1_Y 

9  \2’  2’2 VN'  2VN’^^^,2VN,  2^N  )  ' 

Moreover,  projection  of  u°  onto  the  first  two  eigenvectors  satisfies  P2(u°)  =  C\J N(f>2 , 
where  C  is  approximately  0.5. 

Next,  we  can  simply  use  the  formula  for  the  second  eigenvector  to  show  that  the 
scheme  diverges  for  dt,  =  0(N~a).  The  idea  is  that  after  the  first  iteration,  the  values 
of  u 1  on  the  outlier  nodes  are  high  enough  such  that  the  scheme  diverges.  Below  is 
an  outline  of  the  proof. 

Proof.  [Proposition  4.4]  We  run  through  the  scheme  with  u°  =  Sgn((f>2),  dt  = 
SN~a. 

(Stepl)  We  compute  u1.  Since  u°  is  ±1  valued,  v°  =  u°.  By  boundedness 
of  eigenvalues,  u1  =  Coy/Ntp2  =  0(\fN)(j)2,  since  Co  is  bounded  from  below  with 
respect  to  N. 

(Step  2)  We  compute  u2.  Note  since  ■u1(l)  =  — u1(2)  =  0(\/N),  and  dt  <  N~a, 
wehave|u^|  =  |(1  —  dt)u\  +dt(u\)3\  =  0(y/N)  +  0(N3^2 /N~a)  =  0(Ne/2),  where  9  > 
1.  Similarly,  id(j)  =  0(1)  for  j  ^  2.  Moreover,  by  symmetry,  u1(2 k)  =  —u1(2k  +  1), 
and  hence  we  have  v 1  =  0(Ne/2)(f>2  +  0(N^~ ).  By  calculating  projections  onto  (f> 2 , 
u2  =  0(Ne/2)(j)2  +  O(N^). 

(Step  3)  Inductively,  we  can  show  that  uk  =  0(N^~ )(jr  +  0(N  2  ).  And 

thus  letting  k  — >  oo,  uk  — »  oo.  □ 

Whether  the  theoretical  worst  case  bound  is  attained  if  we  project  to  more  than 
two  eigenvectors  is  not  proved  here  and  could  be  done  in  future  work.  However,  due 
to  level  of  freedom  in  constructing  such  graphs,  the  thought  is  that  there  are  more 
complicated  examples  such  that  the  bound  is  attained  for  truncation  level  m  >  2. 


4.3.  Heuristic  Explanation  for  Good  Typical  Behavior.  Despite  the  patho¬ 
logical  behavior  of  the  example  given  above,  the  timestep  for  spectral  truncation  does 
not  depend  badly  on  the  size  N  in  practice.  In  this  section,  we  attempt  to  give  a 
heuristic  explanation  of  this  from  two  viewpoints. 

The  first  view  is  to  analyze  the  projection  operator  Pm  in  the  L°°  norm.  The 
reason  why  the  maximum  principle  fails  is  because  Pm  is  expansive  in  the  L°°  norm. 
Namely,  for  some  vector  [uHoo  <  1,  we  have  ||Pm(tO||oo  =  0(y/N)  in  the  worst  case. 
However,  an  easy  analysis  shows  the  probability  of  attaining  such  an  0(y/~N)  bound 
decays  exponentially  as  N  grows  large,  as  shown  in  a  simplified  analysis  in  Proposition 
8.3  of  the  Appendix.  Thus  in  practice,  it  is  very  rare  that  adding  Pm  would  violate 
the  maximum  principle  “too  much” . 

The  second  view  is  to  restrict  our  attention  to  data  that  come  from  a  random 
sample.  Namely,  we  assume  our  data  points  x1  are  sampled  i.i.d.  from  a  probability 
distribution  p,  and  that  the  graph  Laplacian  is  computed  from  the  Euclidean  distance 
1 1  a;*  —  ar-7 1 1 .  In  [39],  it  is  proven  that  under  very  general  assumptions,  the  discrete 
eigenfunctions,  eigenvalues  converges  to  continuous  limits  almost  surely.  Moreover, 
the  projection  operators  Pk  converges  compactly  almost  surely  to  their  continuous 
limits.  Moreover,  results  for  continuous  limits  of  graph-cut  problems  can  be  found 
in  [34].  Under  this  set  up,  we  can  define  the  Allen-Cahn  scheme  on  the  continuous 
domain  and  discuss  its  properties  on  suitable  function  spaces.  The  spectral  truncated 
scheme  still  would  not  satisfy  the  maximum  principle,  but  at  least  it  evolves  in  a 
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sample-size  independent  fashion.  Of  course  a  rigorous  proof  would  require  heavy 
functional  analysis. 

5.  Results  for  Multiclass  Classification.  The  previous  analysis  can  be  car¬ 
ried  over  in  a  straight  forward  fashion  to  the  multiclass  Ginzburg-Landau. 

Multiclass  diffuse  interface  algorithm  on  graphs  can  be  found  in  [20,24,28].  In 
most  of  the  algorithms,  the  labels  are  vectorized  into  separate  coordinates.  To  be 
more  precise,  given  K  the  number  of  classes,  and  N  the  number  of  nodes  on  the 
graph,  we  define  an  N  x  K  matrix  u,  where  each  entry  Uij  stands  for  the  “score”  of 
the  ith  node  belonging  to  the  jth  class.  Often  one  would  project  u  onto  the  Gibbs 
simplex  Q  =  =  1| x,  >  0}  [21]  to  make  u^j  into  a  probability  distribution. 

The  Ginzburg  Landau  functional  for  multiclass  is  defined  as 

1  N 

(5.1)  GL(u)  =  ^tr{uLu)  + —’Y^W(ui). 

Here,  Wj  stands  for  the  i-th  row  of  u,  and  IT  should  be  a  “multi-well”  function  that 
is  analogous  to  the  double-well  in  the  binary  case.  The  function  should  have  local 
minima  near  the  unit  vectors  e&  =  (0,  0, . . . ,  1, . . . ,  0)*,  and  grows  fast  when  u  is  far 
from  the  origin.  We  use  the  L 2  double  well,  namely, 

(5.2)  W{ui)  =  {Jl^l\\ui-ek\\l). 

In  [21],  a  different  double  well  is  used  where  L 1  norms  are  taken  instead  of  L2 .  The 
paper  claimed  that  L 2  double  well  suffers  from  the  problem  that  the  function  value 
of  IT  in  the  center  of  the  Gibbs  simplex  is  small.  This  problem  could  be  alleviated 
if  we  rescale  distance  by  a  suitable  function  p(x).  Namely,  replacing  ||iti  —  efeHl  by 
p(\\ui  —  efe || 2 ) -  Moreover,  when  k  is  reasonably  small,  even  such  adjustments  are 
unnecessary.  However,  choosing  the  L 2  well  comes  with  the  bonus  advantage  that 
the  problem  is  smooth.  This  gives  better  convergence  guarantees  as  well  as  makes 
the  problem  easier  to  compute  numerically.  For  example,  the  L2  double  well  does  not 
require  projection  onto  the  Gibbs  simplex  Q  in  every  iteration  as  in  [21]. 

We  minimize  GL  using  the  forward-backward  method  as  in  (2.4). 

(5-3)  2e  i 

[  uk+1  =  -dt  *  (eLuk+1)  +  vk. 

Since  the  diffusion  step  is  done  columns- wise,  the  maximum  principle  carries  over 
naturally.  Namely,  we  have 

Proposition  5.1  (Maximum  Principle  Multiclass).  Let  u,v  be  K  x  N  matrices. 
Define 

(5.4)  u  =  —dt  *  ( eLu )  +  v, 

then  ||u.jj|oo  <  HtjIIocm  where  Uj,Vj  are  the  jth  column  ofu,v  respectively. 

With  the  exact  same  reasoning  as  in  the  binary  case,  we  need  a  range  of  stepsize 
dt  for  which  the  forward  gradient  step  Ft  of  the  “multi-well”  maps  [— R,  R\NxK  onto 
itself,  as  the  next  lemma  shows. 

Lemma  5.2.  Define  Ft  :  Ui  ^  Ui  —  dt  *  ^VlT(ttj).  Then  3R(K)  and  3 c(R,  K) 
independent  of  N  such  that  for  dt  <  c(R1K),  Ft([—R,  R]K)  C  [—  R,  R\K  . 
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Proof.  Since  the  double  well  W  does  not  depend  on  N,  the  constants  R  and  dt 
are  naturally  independent  of  N  if  we  prove  its  existence. 

We  define  a  new  map  fdt.  to  be  fdt(R)  =  sup{||Fdt(u)||oo,  u  G  d[—R,R]K}.  It 
can  be  shown  that  (j>dt  is  continuous.  Since  the  double  well  W  is  nearly  quadratic 
when  R  is  large,  we  have  that  3Ri  such  that  ,r2]k  points  inward  of  the  box 

[—  R-2,  R2]k  ,  for  all  R2  >  R\.  Thus  we  can  find  c  dependent  on  R2  such  that  4>dt(R2)  < 
i?2,VI?2  >  R\,dt  <  c.  Take  R  =  max0c([O,  i?i]),  by  shrinking  c  if  necessary,  we  have 
max<^c([0,  R])  <  R,  and  thus  Tt{[—  R,  R]K)  C  [— R,  R]K ,  Vdf  <  c.  □ 

The  proof  works  for  any  function  that  acts  independently  on  each  component  Ui 
and  has  fast  growth  towards  infinity.  The  estimates  here  are  not  as  precise  as  the 
0.5e  bound  in  the  binary  case,  since  an  explicit  calculation  will  be  a  rather  compli¬ 
cated  formulae  that  involves  I\ .  However,  in  practice,  the  stepsize  restriction  is  also 
comparable  to  0.5e,  at  least  when  the  number  of  clusters  K  is  moderate. 

Additional  fidelity  terms  and  alternative  graph  Laplacian  could  be  handled  the 
same  way  as  in  the  binary  case.  Hence  we  have, 

Theorem  5.3  (Convergence).  The  multiclass  graph  Allen-Cahn  scheme,  with  or 
without  fidelity,  is  convergent  for  stepsize  dt  <  ce. 

6.  Numerical  Results.  In  this  section,  we  construct  various  numerical  experi¬ 
ments  of  increasingly  larger  scales.  This  helps  demonstrate  our  theory,  and  also  have 
some  implication  on  the  real  world  performance  of  the  schemes. 

6.1.  Two  Moons.  The  two  moons  data  was  used  by  Buhler  et  al  [10]  in  ex¬ 
ploring  spectral  clustering  with  p-Laplacians.  It  is  constructed  from  sampling  from 
two  half  circles  of  radius  one  on  R2,  centered  at  (0,0)  and  (1,0.5).  Gaussian  noise 
of  standard  deviation  0.02  in  R100  is  then  added  to  each  of  the  points.  The  weight 
matrix  is  constructed  using  Zelnik-Manor  and  Perona’s  procedure  [41].  Namely,  set 
Wij  =  e~d VTiTi ,  where  r,;  is  the  M th  closest  distance  to  i.  W  is  further  sym¬ 
metrized  by  taking  the  max  between  two  symmetric  entries. 

Fig. 6.1  is  an  illustration  of  the  data  set  of  three  different  sizes  being  segmented 
perfectly  under  a  uniform  stepsize.  A  zero  mass  constraint  is  used  instead  of  fidelity 
points,  and  random  initialization  is  chosen.  The  parameters  for  the  experiment  is 
dt  =  0.5,  e  =  1,  which  is  exactly  the  tight  bound. 


Fig.  6.1:  Segmentation  results  under  the  same  stepsize  for  Two  Moons  with  sample  sizes 
1000,  2000,  3000  respectively. 


To  test  the  theory  in  a  more  rigours  manner,  we  compute  several  “maximum 
stepsizes”  that  ensures  some  criterion  (e.g.  bounded  after  500  iterations,  etc.),  and 
compare  this  with  the  stepsize  predicted  by  the  theory.  Bisection  with  le-5  accuracy 
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Fig.  6.2:  Two  Moons  Segmentation  Problem.  Left:  Maximum  stepsize  satisfying  ||  wfc  ||  oo  <  1. 
Right:  Left:  Maximum  stepsize  satisfying  ||itfc||oo  <  10.  N  is  the  number  of  nodes. 


is  used  to  determine  the  maximum  stepsize  that  satisfies  the  criterion  given. 

Fig  6.2  plots  the  maximum  stepsize  for  the  scheme  (1.9)  to  be  bounded  by  1.0005, 
10  respectively.  Random  —1,1  initial  conditions  are  chosen.  No  fidelity  terms  are 
added  and  the  diffuse  parameter  e  =  1.  We  also  compute  results  for  the  random 
walk  Laplacian  and  the  unnormalized  Laplacian  as  comparison.  The  actual  results 
are  independent  of  graph  size,  and  also  match  the  tight  and  loose  bound  nicely. 

Since  we  are  interested  in  convergence  stepsize,  we  switch  our  criterion  from 
boundedness  to  convergence,  namely,  we  compute  the  stepsizes  for  which  the  scheme 
has  iterative  difference  less  than  le-4  in  1000  iterations,  e  is  still  chosen  to  be  1. 

Fig. 6. 3  (left)  plots  the  limit  convergence  stepsize  for  the  scheme  with  the  three 
different  Laplacians.  As  we  can  see,  the  typical  limit  stepsize  is  between  the  tight  and 
loose  bound.  Fig. 6. 3  (right)  fixes  N  =  2000  and  varies  e  to  plot  the  relation  between 
dt  and  e.  They  are  almost  linear  as  predicted  by  the  0.5e  bound. 


0  500  1000  1500  2000  0  0.2  0.4  0.6  0.8  1 


Fig.  6.3:  Two  Moons  Segmentation  Problem.  Left:  Maximum  stepsize  for  convergence,  fixing 
e  =  1  varying  N .  Right:  Maximum  stepsize  for  convergence,  fixing  N  =  2000  varying  e,  t  is 
the  interface  scale  parameter.  N  is  the  number  of  nodes  on  the  graph. 

Fig. 6. 4  (left)  plots  the  scheme  with  spectral  truncation.  The  results  are  compared 
with  the  full  scheme,  and  are  roughly  in  the  same  range.  Fig. 6.4  (right)  plots  the 
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Fig.  6.4:  Two  Moons  Segmentation  Problem.  Left:  Maximum  convergence  stepsize  comparing 
spectral  truncation  vs  full  scheme.  Right:  Maximum  convergence  stepsize  for  scheme  with 
fidelity. N  the  number  of  nodes,  c  is  the  fidelity  strength. 


effects  of  adding  a  quadratic  fidelity  term  with  power  c  while  keeping  e  =  1  fixed.  As 
we  can  see  from  the  result,  the  fidelity  term  does  constitute  an  additional  restriction 
when  c  is  large.  However,  stepsizes  remain  roughly  the  same  for  small  c.  It  is  hard  to 
analyze  the  exact  effect  when  c  and  e  are  comparable. 

6.2.  Worst  Case  Graph.  Despite  the  good  practical  behavior  of  spectral  trun¬ 
cation,  this  experiment  shown  in  Fig. 6. 5  is  a  realization  of  the  worst  case  stepsize 
restriction  for  spectral  truncation.  The  plot  in  log-log  axis  shows  the  convergent  dt 
vs  the  size  of  the  problem  N.  The  scheme  is  initialized  by  thresholding  the  2nd  eigen¬ 
vector.  The  slope  of  the  descent  matches  the  theoretical  k  =  —  1  line  almost  exactly, 
proving  the  optimality  of  the  theoretical  result. 


Fig.  6.5:  Left:  Cartoon  figure  of  the  worst  case  graph.  Right:  Log  plot  of  maximum  stepsize 
for  {wfc}  to  be  bounded.  N  the  number  of  nodes  on  the  graph. 


6.3.  Two  Cows.  The  point  of  this  experiment  is  to  test  the  effects  of  Nystrom 
sampling  on  the  stepsize  and  overall  performance  of  the  algorithm.  The  images  of 
the  two  cows  are  from  the  Microsoft  Database.  For  large  dense  graphs  such  as  non¬ 
local  graphs  from  images,  it  is  often  impractical  to  compute  the  entire  graph.  Nystrom 
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sampling  is  a  technique  used  to  approximate  eigenvectors  without  explicitly  computing 
the  graph  Laplacian  [3,18,19]. 

From  the  original  312  x  280  image,  we  generate  10  images  with  successively  lower 
resolution  of  (312 /k)  x  (280 /k),  k  =  1,  ...10.  A  non-local  graph  constructed  from 
feature  windows  of  size  7  x  7  is  used,  and  weights  are  constructed  by  the  standard 
Gaussian  Kernel  Wij  =  e~dij'a  .  The  eigenvectors  are  constructed  by  using  Nystrom 
extension,  the  details  of  which  could  be  found  in  [5] . 

Nystrom  extension  produces  an  orthogonal  set  of  vectors  that  approximates  the 
true  eigenvectors  by  subsampling  from  the  original  graph.  The  following  examples 
show  that  this  imprecision  does  not  cause  numerical  instability. 

Fig. 6. 6  illustrates  three  images  with  1,1/2, 1/5  times  original  resolution  being  seg¬ 
mented  under  the  uniform  condition  dt  =  2,  e  =  4.  The  blue  and  red  box  corresponds 
to  fidelity  points  of  the  two  classes,  the  constant  in  front  of  the  fidelity  are  C\  =  1  and 
C2  =  0.4  for  the  cows  and  the  background  respectively. 

Fig. 6. 7  is  a  profile  of  N  vs  dt.  To  ensure  segmentation  quality,  smaller  epsilon 
had  to  be  chosen  for  images  of  lower  resolution,  and  the  final  result  is  displayed  in 
terms  of  the  dt/e  ratio. 


% 


(a)  256  x  256  (b)  128  x  128  (c)  51  x  51 


(d)  Segmentation  Result  1 


(e)  Segmentation  Result  2 


(f)  Segmentation  Result  3 


Fig.  6.6:  Images  of  different  resolution  segmented  under  the  same  stepsize 


6.4.  MNIST.  This  experiment  is  used  to  demonstrate  the  case  of  multiclass 
clustering  by  the  L 2  multiclass  Ginzburg-Landau  functional. 

The  MNIST  database  [26]  is  a  data  set  of  70000  28  x  28  images  of  handwritten 
digits  from  0-9.  The  graph  is  constructed  by  first  doing  a  PCA  dimension  reduc¬ 
tion  and  again  using  the  same  Zelnik-Manor  and  Perona’s  procedure  with  50  nearest 
neighbors. 

For  our  purpose,  we  focus  on  clusters  of  size  three.  Table  6.1  shows  the  limit  step- 
sizes  of  various  tuples,  and  the  error  rate  when  segmented  under  a  uniform  stepsize. 
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Fig.  6. 7:  Maximum.  Convergence  Stepsize  for  Two  Cows  a  Series  of  Different  Resolution.  N 
is  the  number  of  nodes  in  the  graph,  which  equals  Ax  B  with  A,  B  the  height  and  width  of 
an  image. 


5%  fidelity  points  are  used,  and  e  =  1.  The  scheme  is  projected  onto  the  first  100 
eigenvectors.  It  is  shown  here  that  they  are  still  segmented  around  the  same  stepsize. 


Tuples 

{4,6,7} 

{3,5,8} 

{1,0,9} 

{0,6,1} 

{2,7,1} 

Max  dt 

0.5823 

0.5914 

0.5716 

0.5701 

0.5755 

Correct  (dt=0.5) 

97.98% 

97.58% 

96.00% 

96.36% 

98.22% 

Table  6.1:  Clustering  results  of  MNIST.  For  each  digit,  N  «  6000.  First  Row:  triplets  of 
digits  to  be  classified.  Second  Row:  Maximum  stepsize  for  convergence.  Third  Row:  Error 
rate  with  a  fixed  dt  that  is  close  to  the  maximum  stepsize. 


7.  Discussion.  In  summary,  we  show  that  the  semi-implicit  scheme  for  solv¬ 
ing  the  graph  Allen-Cahn  equation  converges  to  a  stationary  point  under  a  graph- 
independent  stepsize  dt  =  0.5e.  The  proof  combines  ideas  form  classical  numerical 
analysis  and  also  convex  analysis.  We  then  analyze  the  same  convergence  stepsize 
problem  for  the  scheme  under  spectral  truncation.  We  show  that  unlike  the  previous 
case,  a  graph-independent  stepsize  bound  that  works  on  all  graphs  is  no  longer  pos¬ 
sible.  This  is  because  maximum  principle  no  longer  holds  under  spectral  truncation. 
A  new  bound  dt  =  0(1V-1)  is  obtained  and  is  shown  to  be  sharp  in  the  worst  case. 
Some  heuristics  were  provided  to  explain  the  discrepancy  between  the  worst  case  per¬ 
formance  and  the  good  average  case  behavior  when  applying  spectral  truncation.  We 
then  present  a  natural  extension  of  the  analysis  to  multi-class  classification.  We  fi¬ 
nally  conduct  a  variety  of  numerical  experiments  on  various  datasets  to  demonstrate 
how  the  theory  matches  practical  performance. 

There  are  still  some  very  interesting  problems  left  to  be  explored.  One  important 
problem  is  why  so  few  eigenvectors  are  needed  during  spectral  truncation?  Is  this 
unique  for  classification  tasks,  and  how  can  this  be  quantified  in  a  more  theoretical 
framework?  Another  problem  is  the  relationship  between  the  stepsize  and  final  error 
rate.  As  the  problem  is  non-convex,  converging  to  a  sub-optinral  stationary  point  is 
a  possibility.  So  far  this  analysis  does  not  attempt  to  characterize  the  quality  of  the 
final  converged  solution,  but  experiments  have  shown  that  the  error  rates  do  differ 
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under  different  stepsize. 

8.  Appendix.  Lemma  8.1  (Norm  Conversions).  Let  1  <  p  <  q  <  +oo.  Then 
the  formula  below  explicitly  states  the  equivalence  between  norms: 

IMI,  <  ll«l|P  <  ll«ll^1/p-1/9- 


Proof.  The  right  hand  side  is  by  a  generalization  of  Holder’s  inequality.  The  left 
hand  side  is  by  simple  polynomial  expansion.  □ 

Lemma  8.2  (Computation  of  Second  Eigenvector  of  Graph  4.1).  The  second 
eigenvector  of  the  graph  in  Fig.  f.l  is 

,2=A  1  _J _ 1_  _! _ 1_Y 

9  \2’  2’2  VN'  2'/n'  ’  2'/n’  2Vn]  ' 


Proof.  We  set  the  gray  solid  edges  have  weights  a,  and  the  gray  dashed  edges 
/3/n.  Recall  the  variational  formulation  of  the  second  eigenvector 

argmin  Dir(u)  s.t.  (u,  e1)  =  0,  ||it||2  =  1. 

U 


Note  that  by  symmetry,  we  can  assume  <f> 2  =  (a,  —a,  b ,  —b, . 
parameterization,  we  have  that  the  Dirichlet  energy  is 


(8.1) 


Dir  (a,  (3)  =  —  (6  —  a)2  x  n  +  /3(2b)2n. 
n 


Hence  by  computing  the  Lagrange  multipliers,  we  have 

|  nkb  =27716  —  (b  —  a), 

(8-2)  \ 

ka  =a  —  0, 


,  6,  —by.  Under  this 


where  7  =  — ,  and  k  is  the  lagrange  multiplier.  The  equation  for  k  is 

(8.3)  fc2  —  (  —  +  2y  +  l)k  +  27  =  0. 

n 

Setting  2y  =  1+0,  and  computing  the  roots  k,  we  have  k  =  1 — (y  ^  —  $  +  +  0)2/4  —  O'2— 

|  —  ^).  Setting  0  =  0  gives  k  =  1  —  However,  we  need  k  =  1  —  to  yield 

our  desired  eigen- vector.  This  can  be  done  by  setting  the  correction  term  0  =  o(-).  □ 
Proposition  8.3.  Define  the  set 

M  =  {u  e  |  IMloo  <  ljinaxUPmulloo  >  C\/N}, 

where  Pm  is  any  projection  operator  onto  a  subspace,  and  0  <  C  <  1.  Then  the 
volumefwith  respect  to  the  standard  L2  metric  in  RN )  of  the  set  M  decreases  expo¬ 
nentially  with  respect  to  the  number  of  dimensions  N. 

The  proposition  shows  that  if  u  were  sampled  uniformly  from  a  unit  cube,  then 
the  probability  of  some  projection  Pm  expanding  the  max  norm  by  a  factor  of  0(y/N) 
is  exponentially  decreasing  in  N. 


20 


Fig.  8.1:  Illustration  of  Proposition  8.3.  S  is  one  of  the  “caps”  that  vn  resides  in.  un  and 
vn  have  angle  less  than  6. 


Proof.  Let  u  £  M.  Then  by  definition  of  the  set  M,  3  some  projection  Pm  such 
that  ||Pmu||oo  >  Cy/N.  Define  v  :=  Pmu  and  vn  :=  tt^.  Define  un  :=  Since 

vn  is  the  projected  direction  of  u,  Pmu  =  {u,vn)vn.  Then  we  have 

CjN  <  \\PmUWoo  =  (w,Un)||f„||oo  =  \\u\\ 2  ||fr>  ||oo  (un,  Vn)  • 

Since  ||u||2  <  JN,  we  have 

(8.4)  ||Vn||oo(Wn,Wn)  >  C. 

Since  (un,vn)  <  1,  the  projected  direction  vn  must  be  in  the  set  S  =  {v  |  ||c||2  = 
1,  H^Hoo  >  cry.  However,  the  set  S  contains  the  N  “caps”  of  a  unit  sphere  (see  Fig. 8.1), 
and  hence  is  exponentially  decreasing  in  volume  with  respect  to  the  sphere.  On  the 
other  hand,  since  H^Hoo  <  1,  by  (8.4)  we  have  (■ un,vn )  >  (7,  and  thus  u  lies  in  a  cone 
K(vn)  with  angle  cos(0)  >  C.  Hence  u  G  Kv  +  A7”,  and  since  cones  Kv  have  volume 
exponentially  decreasing  with  respect  to  N  as  well,  we  have  Vol(M)  is  exponentially 
decreasing  with  respect  to  N. 

□ 
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